Ecology of Middle East respiratory syndrome coronavirus, 2012–2020: A machine learning modelling analysis

Abstract The ongoing enzootic circulation of the Middle East respiratory syndrome coronavirus (MERS‐CoV) in the Middle East and North Africa is increasingly raising the concern about the possibility of its recombination with other human‐adapted coronaviruses, particularly the pandemic SARS‐CoV‐2. We aim to provide an updated picture about ecological niches of MERS‐CoV and associated socio‐environmental drivers. Based on 356 confirmed MERS cases with animal contact reported to the WHO and 63 records of animal infections collected from the literature as of 30 May 2020, we assessed ecological niches of MERS‐CoV using an ensemble model integrating three machine learning algorithms. With a high predictive accuracy (area under receiver operating characteristic curve = 91.66% in test data), the ensemble model estimated that ecologically suitable areas span over the Middle East, South Asia and the whole North Africa, much wider than the range of reported locally infected MERS cases and test‐positive animal samples. Ecological suitability for MERS‐CoV was significantly associated with high levels of bareland coverage (relative contribution = 30.06%), population density (7.28%), average temperature (6.48%) and camel density (6.20%). Future surveillance and intervention programs should target the high‐risk populations and regions informed by updated quantitative analyses.

. In February 2018, WHO formally incorporated MERS-CoV into the Research and Development Blueprint to promote research in this area (Mehand et al., 2018). The ongoing pandemic of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) further exacerbated this concern due to the possibility of recombination.
Current epidemiological studies suggest that human-to-human transmission of MERS-CoV is inefficient, and the primary infection source is zoonotic (Anthony et al., 2017;David et al., 2018). While the number and size of outbreaks have decreased in recent years, the proportion of primary MERS-CoV cases is gradually increasing . A major driver for zoonotic infections is exposure to dromedaries or consumption of their raw products such as milk (Anthony et al., 2017;Chan et al., 2014;David et al., 2018), although other mammals may also serve as the reservoir (Alraddadi et al., 2016;David et al., 2018;Hui et al., 2018;Killerby et al., 2020). A recent study showed a significant correlation between primary MERS-CoV cases and camel clusters at the provincial level in the KSA (Al-Ahmadi et al., 2020). Besides, MERS-CoV-specific antibodies and MERS-CoV RNA have been widely detected in camels across the Middle East and North, West and East Africa (Dighe et al., 2019;FAO-OIE-WHO MERS Technical Working Group, 2018;Sikkema et al., 2019). However, few studies have assessed the ecological suitability of the virus worldwide at a refined spatial resolution. The presence of an animal reservoir is a necessary but not sufficient condition for zoonotic diseases, as many other factors are needed for sustained spillover or endemic transmis-sion. Similarly, reporting of opportunistic human cases does not imply establishment of an ecological niche as these sporadic cases could be imported. A rigorous assessment of the ecology of MERS-CoV at the global scale is urgently needed to improve surveillance in high-risk areas in preparation for a potential pandemic.

Feature imputation and selection
We obtained 34 variables (33 district-level data and camel density data at district level after imputing) from all data sources (Table S2)  We collected camel density data from three sources: (1) camel density data at the district level from 2012 to 2020 were obtained from the World Organization for Animal Health (OIE; https://wahis.oie.int/#/ dashboards/qd-dashboard); (2) camel count data at the district level in the KSA were obtained from a national agricultural census conducted by the General Authority for Statistics of the KSA (https://www.stats. gov.sa/en/22); and (3) for some countries without district-level data, the total numbers of camels at the country level were obtained from the FAO (http://www.fao.org/faostat/en). The distribution of camel data availability is shown in Figure S3a. As camel is a major reservoir of the MERS-CoV, we imputed missing camel densities at the district level (region either without camel density data or with only the total number of camels at the country level) using a Classification and Regression Trees (CART) model (Supporting Information).

Model development
The selected features were assessed for their association with the presence/absence of MERS human cases or MERS-CoV-positive animal specimens using several ML algorithms that have been widely applied in ecological studies (Bhatt et al., 2013;Fang et al., 2013;Shearer et al., 2018). In this study, a 'case-control' design was constructed to assess ecological features for their impacts on the presence/absence of MERS human cases or MERS-CoV-positive animal specimens at the district level. Before the model fitting, multicollinearity among features was screened via the Spearman rank correlation coefficient. For each group of highly correlated variables (correlation coefficient ≥.7), only one variable was chosen as the proxy to be used for fitting models ( Figure S4).
A two-stage model was developed in this study. In the first stage, we fitted three mainstream ML models including random forest (RF), boosted regression trees (BRT) and support vector machine (SVM) to the data, using R packages randomForest, gbm and e1071. For each model, the following bootstrapping-and-fitting steps were replicated 100 times, each using one of the 100 imputed data sets of camel densities. At first, to improve the robustness of the model and to avoid over fitting, we bootstrapped 100 'case' districts from the overall pool of 117 'case' districts reporting MERS cases or MERS-CoV-positive animal specimens. For each selected 'case' , we bootstrapped four 'controls' from the pool of 1820 'control' districts reporting neither MERS case nor positive animal specimen, to reach a total of 400 (22%) controls.
The number of cases, 100, was chosen to be close to the total number of available case districts while leaving some space for randomness. The 1:4 case-control ratio is based on our previous ecological studies for avian influenza virus A (H7N9) (Fang et al., 2013) and severe fever with thrombocytopenia syndrome (SFTS) (Liu et al., 2015). Features that are time varying, for example, population density and meteorological variables, were averaged over the study period. Next, the bootstrap data set was randomly partitioned into a 75% training data set and a 25% testing data set. The base models were then built using the training data set and validated using the testing data set. Finally, the fitted models were used to predict the probability of MERS-CoV presence for all districts in our data.
In the second stage, we built an ensemble model by stacking the three base ML models. Stacking is one of the ensemble learning strategies that became popular because of their more robust prediction than individual learners (Breiman, 1996;Wolpert, 1992). The modelbuilding procedure involves two steps. In the first step, all base ML models were trained on the training data set with three fold crossvalidation, from which we obtained a meta training data set with predicted probabilities as columns, one per each base model. The trained base models were also applied to the test data set to generate a metatest data set, also with predicted probabilities as columns, one per each base model. In the second step, we fitted an XGBoost model, called the meta-learner, to the meta-training data set, using the predicted probabilities from the base models as features. XGBoost is a scalable gradient tree boosting algorithm which improves the efficiency of split finding with weighted quantile sketch and is able to handle sparsity in features (Chen & Guestrin, 2016). It has been used for disease diagnosis and prediction (Davagdorj et al., 2020;Guan, et al., 2021;Ogunleye & Wang, 2020). The predictive performance of the meta-learner was evaluated using the meta test data. Considering that our study is essentially based on presence-only data as absence of MERS-CoV in a given area is much harder to prove, a sensitivity analysis was conducted using the maximum entropy (MaxEnt) model, a popular tool for modelling presenceonly data (Li et al., 2011).

Model evaluation and risk mapping
Predictive

Other statistical analyses
Comparisons between groups were conducted using Wilcoxon ranksum test for continuous variables and Fisher's exact test for categorical variables. Using predictive features found by the best ML model, we performed logistic regression on the presence of MERS-CoV to evaluate their effects in terms of odds ratios (OR). All analysis was performed in R 3.6.2. All statistical tests were two-sided at the significance level of .05.  (Table S3). Compared to those without animal contact, these cases were significantly older, had more underlying conditions, had longer delay in diagnosis and were associated with higher CFR (Table S3).

From
As camel density is an important risk driver based on current knowledge but not available in many places, we performed statistical imputation of this variable. CART provided the highest R 2 for fitting the regression of camel density on other completely observed variables ( Figure S2) and was thus used to generate 100 complete data sets for further modelling. The imputed camel densities match the observed data satisfactorily (Figures 2b and S3). inputs for the ensemble model ( Figure S6). In addition to the highest AUC, the ensemble model was also associated with the highest accuracy (93.95%, 95% CI: 92.79%-94.93%) and F1 score (0.54). RF provided a similar F1 score.
We mapped the average predicted probabilities of MERS-CoV presence given by the ensemble model (Figure 2c). High and moderate risk areas span over the Middle East, West Asia and the whole North Africa, much wider than the geographic range with human MERS cases recorded. Southern Europe, eastern Africa and southern Africa were predicted to have mild risks, consistent with the observation of only a few human cases or positive animal samples in these regions. The risk map obtained from the maximum entropy MaxEnt model, another popular method for presence-only data, also showed a comparable distribution ( Figure S7). We estimated that 124.1 (95% CI: 110.7−137.6) million people live in high-risk areas, accounting for 2% of the total global population. Uncertainty in the spatial predictions seems to be high in many high-risk districts in northern Africa (Figure 2d), possibly due to lack of human cases.

Variable importance
Based on the BRT model, we found that bareland coverage was the leading contributor to the risk of MERS-CoV presence, with a relative contribution (RC) of 30.06% (95% CI: 28.61%−31.50%), followed by coverage of forest land with an RC of 10.74% (95% CI: 9.56%−11.91%) ( Figure 3a and Table S4). Population density, annual mean temperature (Bio1), coverage of cropland and camel density had moderate RCs, ranging from 6.2% to 7.28%. The higher risk of MERS-CoV presence was associated with higher levels of bareland coverage, population density, annual mean temperature and camel density, but was associated with lower levels of forest coverage and cropland coverage ( Figure 3b). The model-based findings regarding the effects of the risk factors are mostly consistent with the univariable analyses for dichotomized factors except for population density (Figure 3c and Table S5). Among the top nine predictors selected by the BRT model, eight were also classified among the top nine contributors by the RF model. Besides, bareland coverage was also ranked the most important by the RF, SVM and MaxEnt models (Table S4 and Figure S7).

Logistic analysis based on extracted important variables
To interpret the effects of influential risk predictors using more traditional odds ratios, we conducted a logistic regression with the seven predictors (bareland coverage, forest coverage, population density, Bio1, cropland coverage, Bio5 and camel density) that have relative contributions over 5% in the BRT model (Table 2). These predictors were dichotomized based on their values associated with the best Youden index ( Figure S8), except that population density was categorized into three levels based on inflection points (6.05 and 181.3) of the BRT-estimated risk curve (Figure 3b). Forest coverage and cropland coverage were excluded from the multivariable logistic analysis for their high correlations with bareland coverage (P > .6). Bareland coverage, Bio1 and camel density were significantly associated with presence of MERS-CoV in both the univariable and multivariable analyses (Table 2). Bareland (>8.12% vs. ≤8.12%) was associated with an OR of 23.74 (95% CI: 12.93-43.58), and Bio1 (>20.48°C vs. ≤20.48°C) F I G U R E 3 Contributions of key socioenvironmental factors to ecological suitability of MERS-CoV identified by the BRT model. (a) The relative contributions of 20 variables included in the BRT model. Points represent the mean values, and bars represent the 95% confidence intervals. High (≥10%) and moderate (≥5%, <10%) relative contributions are coloured in red and blue, respectively. (b) Risk curves of the top nine influential factors. A higher risk value corresponds to a better ecological suitability for the virus. The distribution of each factor in the data is shown as the histograms in orange. (c) Box plots showing difference in the distribution of the top nine contributors between case districts (n = 117) and the pool of all control districts (n = 1820). Extreme values are removed. The centre line represents the median, and the box represent the inter-quartile range. P-values are based on Wilcoxon rank sum test except two-sample t-test is used for bareland coverage was associated with an OR of 4.05 (95% CI: 2.13-7.69). Furthermore, no significant two-way interaction was found among bareland coverage, population density, Bio1 and camel density. Abbreviations: CI, confidence interval; OR, odds ratio. a Percentage coverage of forest and cropland were not included in multivariate analysis for their high correlations with bareland coverage (P > .6). b Univariable analysis of camel density and multivariable analyses for all variables are averaged over 100 data sets with imputed camel densities. Hemida et al., 2017;Miguel et al., 2017). A recent observational study on a cohort with occupational exposure to dromedary camels in Nigeria supports the conjecture that MERS-CoV infections in Africa have been severely underestimated (Mok et al., 2020). Even in the Middle East where MERS-CoV is known to be enzootic, zoonotic infections could have also been under-detected as cases who reported animal contact account for a relatively small proportion of all reported cases (Ye et al., 2020). The risk map could help us with prioritizing resource-limited high-risk areas for surveillance, prevention and control efforts.

DISCUSSION
We found that both bareland coverage and forest coverage are the leading predictors for the ecological niche of MERS-CoV. The effect of bareland coverage is stable and strong in both the univariable and multivariable analyses. The importance of bareland in the transmission of MERS-CoV might be related to abundance of camels (Mirkena et al., 2018;Sikkema et al., 2019). However, the leading role of bareland coverage is robust to the adjustment for both camel density and temperature, suggesting the possibility that bareland might be ecologically suitable for a variety of other animal hosts not considered in our models, for example certain types of bats and alpacas (Hui et al., 2018;Killerby et al., 2020). A cross-continental fine-scale risk assessment study of ecological niches of the MERS was carried out based on multi-elements, which has not been reported in the current reported literatures. The risk map can be used to guide future surveillance of MERS-CoV in both animal and human hosts, especially in resource-limited countries. One lesson we learned from the pandemic of COVID-19 is that more active surveillance on the animal-human interface is crucial for early warning of evolutionary signatures of the virus that are indicative of its transmission efficiency both from animal to human and from human to human, so that potential pandemic can be prevented or controlled at the source. The risk map can also inform selection of sites for clinical trials to evaluate effectiveness of intervention products or programs such as candidate vaccines against MERS-CoV (Folegatti et al., 2020).